Why reproducible research (in R)?

Some arguments…

Why Pagedown?

Formatting text as PDF is probably one of the most widespread standards in the scientific community, especially when it comes to submitting papers and similar documents. The traditional way to well-formatted and good-looking PDFs is often through LaTeX or Word. However, if you have spent hours and hours debugging latex code (or getting it to run) you may be on the lookout for something new.

The fairly new pagedown R package takes a completely new approach. While the main purpose of pagedown is to create high-quality PDFs, the idea is to take advantage of modern web technologies (HTML/JSS/Javascript) with which one can design web pages and eventually print those to PDF.

While web pages are usually single-page scrollable documents, pagedown uses the JavaScript library Paged.js which allows documents to be paginated with elements like headers, footers and everything a readable scientific paper will need. Additionally, pagedown documents are based on R Markdown. In our view, Pagedown and the underlying technology may replace Latex in the long run.

Prerequesites

We assume that you are using R on a day-to-day basis and you may have even started to work in R Markdown. If you don’t know what R Markdown is there are many great resources that you should use (e.g. watch this short video). An older template [see Bauer (2018); https://osf.io/q395s] on which this newer template is based, may provide a quick entry point to writing a reproducible with R Markdown and Latex.

Based on R Markdown, Pagedown allows you to create custom and well-formatted (paged) HTML Documents. For a comprehensive overview watch this video which is a record of a talk introducing pagedown given by Yihui Xie (who in addition to Romain Lesur developed the pagedown package). If you are not in a video watching mood find the slides here.

Then…

remotes::install_github('rstudio/pagedown')
install.packages(c("rmarkdown", "knitr", "kableExtra",
                   "stargazer", "modelsummary", "knitr"))

Basics: Input files, output files and the YAML header

All the files you need to produce the present PDF file are:

  1. the input files:
  1. the “styling” files:

Basically, these are files you will need to specify in the YAML of your rmd-file, so that R and ultimately pagedown recognizes the certain style you want to achieve for your document. With using our templates, you will create a document that has the “look” of a working paper.

Take paper.rmd (the underlying R Markdown file of this pdf) and have a look at the YAML (line #18 - #22) to see how to specifiy these files. Basically, what happens here is that within the jss_paged function we additionally specify that we want to use custom CSS and custom HTML.

Download these files and save them into a folder. Close R/Rstudio and directly open paper_pagedown.rmd with RStudio. Doing so assures that the working directory is set to the folder that contains paper.rmd and the other files.4

Once you run/compile the paper.rmd file in Rstudio it creates a output file called paper_pagedown.html.

By using pagedown’s chrome_print function in the YAML (line #25) your html based web page will be printed to paper_pagedown.pdf (the one you are reading right now).

Both outputs will be saved in your working directory.

Referencing within your document

To see how referencing works simply see the different examples for figures, tables and sections below. For instance in Section @ref(sec:tables) you can find different ways of referencing tables. The code of the underlying paper.rmd will show you how I referenced Section @ref(sec:tables) right here namely with ‘Section \@ref(sec:tables).’

Software versioning

Software changes and gets updated, especially with an active developer community like that of R. Luckily you can always access old versions of R and old version of R packages in the archive. In the archive you need to choose a particular package, e.g dplyr and search for the right version, e.g., dplyr_0.2.tar.gz. Then insert the path in the following function: install.packages("https://....../dplyr_0.2.tar.gz", repos=NULL, type="source"). Ideally, however, results will be simply reproducible in the most current R and package versions.

I would recommend to use the command below and simply add it to the appendix as I did here in Appendix @ref(sec:rsessioninfo). This will make sure you always provide the package versions that you used in the last compilation of your paper. For more advanced tools see packrat.

cat(paste("#", capture.output(sessionInfo()), "\n", collapse ="")) 
  # or use message() instead of cat()

Data

Import

Generally, code is evaluated by inserting regular R Markdown blocks.

x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10

Below we import an exemplary dataset (download).

data <- read.csv("data.csv")
head(data)
##   X speed dist
## 1 1     4    2
## 2 2     4   10
## 3 3     7    4
## 4 4     7   22
## 5 5     8   16
## 6 6     9   10

Putting your entire data into the .rmd file

Applying the function dput() to an object gives you the code needed to reproduce that object. So you could paste that code into your .rmd file if you don’t want to have extra data files. This makes sense were data files are small.

dput(data)
## structure(list(X = 1:50, speed = c(4L, 4L, 7L, 7L, 8L, 9L, 10L, 
## 10L, 10L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 
## 14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 
## 18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 22L, 23L, 24L, 
## 24L, 24L, 24L, 25L), dist = c(2L, 10L, 4L, 22L, 16L, 10L, 18L, 
## 26L, 34L, 17L, 28L, 14L, 20L, 24L, 28L, 26L, 34L, 34L, 46L, 26L, 
## 36L, 60L, 80L, 20L, 26L, 54L, 32L, 40L, 32L, 40L, 50L, 42L, 56L, 
## 76L, 84L, 36L, 46L, 68L, 32L, 48L, 52L, 56L, 64L, 66L, 54L, 70L, 
## 92L, 93L, 120L, 85L)), class = "data.frame", row.names = c(NA, 
## -50L))

You can then insert the dput output in your .rmd as below.

data <- structure(list(X = 1:50, speed = c(4L, 4L, 7L, 7L, 8L, 9L, 10L, 
10L, 10L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 
14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 
18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 22L, 23L, 24L, 
24L, 24L, 24L, 25L), dist = c(2L, 10L, 4L, 22L, 16L, 10L, 18L, 
26L, 34L, 17L, 28L, 14L, 20L, 24L, 28L, 26L, 34L, 34L, 46L, 26L, 
36L, 60L, 80L, 20L, 26L, 54L, 32L, 40L, 32L, 40L, 50L, 42L, 56L, 
76L, 84L, 36L, 46L, 68L, 32L, 48L, 52L, 56L, 64L, 66L, 54L, 70L, 
92L, 93L, 120L, 85L)), 
class = "data.frame", row.names = c(NA, 
-50L))

Tables

Producing good tables and referencing these tables within a R Markdown PDF has been a hassle but got much better. Examples that you may use are shown below. The way you reference tables is slightly different, e.g., for stargazer the label is contained in the function, for kable it’s contained in the chunk name.

Tables with kable() and kable_styling()

A great function is kable() (knitr package) in combination with kableExtra. Table @ref(tab:table-2) provides an example. To reference the table produced by the chunk you need to add ´tab:´ to the chunk name, i.e., ´tab:table-2´ and would reference it by adding “Table \@ref(tab:table-2)” in your text.

library(knitr)
library(kableExtra)

kable(mtcars[1:10,], row.names = TRUE, 
      caption = 'Table with kable() and kablestyling()', 
      format = "html", booktabs = T) %>%
        kable_styling(full_width = T, 
                      latex_options = c("striped", 
                                        "scale_down",
                                        "HOLD_position"),
                      font_size = 10)
Table with kable() and kablestyling()
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

Tables with modelsummary

The modelsummary package provides a variety of tables and plots to summarize statistical models and data in R. Modellsummary plots and tables are highly customizable and they can be saved to almost all formats, e.g., HTML, PDF and Markdown. This makes ist especially easy to embed them in dynamic documents. Please look at the package’s extensive documentation where they also show examples for almost any plot or table you might be looking for. In this template we demonstrate an example for modelsummary’s datasummary function. Datasummary creates frequency tables, crosstab tables, correlation tables, balance tables and many more.

Summarize numeric variables

Table @ref(tab:table-3) shows a summary table for numeric variables.

library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.0.5
datasummary_skim(mtcars, 
                 type="numeric", 
                 histogram=T, 
                 title = "Summary: Numeric variables")
Summary: Numeric variables
Unique (#) Missing (%) Mean SD Min Median Max
mpg 25 0 20.1 6.0 10.4 19.2 33.9
cyl 3 0 6.2 1.8 4.0 6.0 8.0
disp 27 0 230.7 123.9 71.1 196.3 472.0
hp 22 0 146.7 68.6 52.0 123.0 335.0
drat 22 0 3.6 0.5 2.8 3.7 4.9
wt 29 0 3.2 1.0 1.5 3.3 5.4
qsec 30 0 17.8 1.8 14.5 17.7 22.9
vs 2 0 0.4 0.5 0.0 0.0 1.0
am 2 0 0.4 0.5 0.0 0.0 1.0
gear 3 0 3.7 0.7 3.0 4.0 5.0
carb 6 0 2.8 1.6 1.0 2.0 8.0

Summarize categorical variables

Table @ref(tab:table-4) shows a summary table for categorical variables.

# Create categorical variables
mtcars$vs_cat <- as.logical(mtcars$vs)
mtcars$cyl_cat <- as.factor(mtcars$cyl)
datasummary_skim(mtcars, 
                 type="categorical", 
                 title = "Summary: Categorical variables")
Summary: Categorical variables
N %
vs_cat FALSE 18 56.2
TRUE 14 43.8
cyl_cat 4 11 34.4
6 7 21.9
8 14 43.8

Regression table

Table @ref(tab:table-5) shows the output for a regression table. Make sure you name all your models and explicitly refer to model names (M1, M2 etc.) in the text.

library(modelsummary)
M1 <- lm(Fertility ~ Education + Agriculture, data = swiss)
M2 <- lm(Fertility ~ Education + Catholic, data = swiss)
M3 <- lm(Fertility ~ Education + Infant.Mortality + Agriculture, data = swiss)
models <- list("M1" = M1, "M2" =  M2, "M3" = M3)

modelsummary(models, title = 'A regression table.')
A regression table.
M1 M2 M3
(Intercept) 84.080 74.234 51.101
(5.782) (2.352) (10.995)
Education -0.963 -0.788 -0.857
(0.189) (0.129) (0.173)
Agriculture -0.066 -0.026
(0.080) (0.073)
Catholic 0.111
(0.030)
Infant.Mortality 1.493
(0.439)
Num.Obs. 47 47 47
R2 0.449 0.575 0.566
R2 Adj. 0.424 0.555 0.536
AIC 349.7 337.6 340.5
BIC 357.1 345.0 349.7
Log.Lik. -170.846 -164.782 -165.243
F 17.945 29.705 18.699

Inline code & results

Reproduction reaches new heights when you work with inline code. For instance, you can automatize the display of certain coefficients within the text. An example is to include estimates, e.g., the coefficient of dist of the model we ran above. `r round(coef(M1)[2], 2)` will insert the coefficient as follows: -0.96. Or `r 3 + 7` will insert a 10 in the text.
Inline code/results that depend on earlier objects in your document will automatically be updated once you change those objects. For instance, imagine a reviewer asks you to omit certain observations from your sample. You can simply do so in the beginning of your code and push play subsequently.. at time you might have to set cache = FALSE at the beginning so that all the code chunks are rerun.
Researchers often avoid referring to results in-text etc. because you easily forget to change them when revising a manuscript. At the same it can make an article much more informative and easier to read, e.g., if you discuss a coefficient in the text you can directly show it in the section in which you discuss it. Inline code allows you to do just that. R Markdown allows you to that do so in a reproducible and automatized manner.

Figures

R base graphs

Inserting figures can be slightly more complicated. Ideally, we would produce and insert them directly in the .rmd file. It’s relatively simple to insert R base graphs as you can see in Figure @ref(fig:fig-1).

plot(cars$speed, cars$dist)
Scatterplot of Speed and Distance

Scatterplot of Speed and Distance

But it turns out that it doesn’t always work so well.

ggplot2 graphs

Same is true for ggplot2 as you can see in Figure @ref(fig:fig-2).

library(ggplot2)

mtcars$cyl <- as.factor(mtcars$cyl) # Convert cyl to factor

ggplot(mtcars, aes(x=wt, y=mpg, shape=cyl)) + geom_point() +
  labs(x="Weight (lb/1000)", y = "Miles/(US) gallon", 
       shape="Number of \n Cylinders") + theme_classic()
Miles per gallon according to the weight

Miles per gallon according to the weight

Interactive graphs

Compiling the document

To view your paper, pagedown requires a web server (since it is based on paged.js)5. By compiling a document, R Studio will display your HTML page through a local web server, i.e., paged.js will work in RStudio Viewer.

There are several options, depending on your intention:

Good practices for reproducibility

Every researcher has his own optimized setup. Currently we would recommend the following:

Additional tricks for publishing

Citation styles

If your study needs to follow a particular citation style, you can set the corresponding style in the header of your .rmd document. To do so you have to download the corresponding .csl file.

In the present document we use the style of the American Sociological Association and set it in the preamble with csl: american-sociological-association.csl. However, you also need to download the respective .csl file from the following github page: https://github.com/citation-style-language/styles and copy it into your working directory for it to work.

The github directory contains a wide variety of citation style files depending on what discipline you work in.

References

Bauer, Paul. 2018. “Writing a Reproducible Paper in R Markdown.” Open Science Framework Preprint, December, 1–14.
Kirsop, Barbara, and Leslie Chan. 2005. “Transforming Access to Research Literature for Developing Countries.” Serials Review 31 (4): 246–55.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
RStudio Team. 2015. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, Inc. http://www.rstudio.com/.
Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.
———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.name/knitr/.
———. 2016. Bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman; Hall/CRC. https://github.com/rstudio/bookdown.
———. 2017. Bookdown: Authoring Books and Technical Documents with r Markdown. https://github.com/rstudio/bookdown.
———. 2018. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.name/knitr/.
———. 2021. Xaringan: Presentation Ninja. https://CRAN.R-project.org/package=xaringan.
Xie, Yihui, and Romain Lesur. 2021. “Pagedown: Create Paged HTML Documents for Printing from R Markdown.” https://rstudio.github.io/pagedown/.
Xie, Yihui, Romain Lesur, Brent Thorne, and Xianying Tan. 2021. Pagedown: Paginate the HTML Output of r Markdown with CSS for Print. https://CRAN.R-project.org/package=pagedown.
Zhu, Hao. 2017. kableExtra: Construct Complex Table with ’Kable’ and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.

Online appendix

Attach R session info in appendix

Since R and R packages are constantly evolving you might want to add the R session info that contains information on the R version as well as the packages that are loaded.

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_4.9.3       ggplot2_3.3.3      modelsummary_0.8.1 kableExtra_1.3.4  
## [5] knitr_1.33        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6         svglite_2.0.0      tidyr_1.1.3        ps_1.6.0          
##  [5] assertthat_0.2.1   digest_0.6.27      utf8_1.2.1         R6_2.5.0          
##  [9] backports_1.2.1    evaluate_0.14      httr_1.4.2         highr_0.9         
## [13] pillar_1.6.1       rlang_0.4.11       lazyeval_0.2.2     rstudioapi_0.13   
## [17] data.table_1.14.0  jquerylib_0.1.4    checkmate_2.0.0    rmarkdown_2.8.6   
## [21] labeling_0.4.2     webshot_0.5.2      servr_0.22         stringr_1.4.0     
## [25] htmlwidgets_1.5.3  munsell_0.5.0      broom_0.7.6        compiler_4.0.4    
## [29] httpuv_1.6.1       xfun_0.23          pkgconfig_2.0.3    systemfonts_1.0.2 
## [33] htmltools_0.5.1.1  websocket_1.4.0    tidyselect_1.1.1   tibble_3.1.1      
## [37] fansi_0.4.2        viridisLite_0.4.0  crayon_1.4.1       dplyr_1.0.6       
## [41] withr_2.4.2        later_1.2.0        tables_0.9.6       grid_4.0.4        
## [45] jsonlite_1.7.2     gtable_0.3.0       lifecycle_1.0.0    DBI_1.1.1         
## [49] magrittr_2.0.1     scales_1.1.1       stringi_1.5.3      farver_2.1.0      
## [53] promises_1.2.0.1   xml2_1.3.2         bslib_0.2.5        ellipsis_0.3.2    
## [57] vctrs_0.3.8        generics_0.1.0     RColorBrewer_1.1-2 tools_4.0.4       
## [61] glue_1.4.2         purrr_0.3.4        crosstalk_1.1.1    processx_3.5.2    
## [65] yaml_2.2.1         colorspace_2.0-1   rvest_1.0.0        pagedown_0.14     
## [69] sass_0.4.0

All the code in the paper

To simply attach all the code you used in the PDF file in the appendix see the R chunk in the underlying .rmd file:

knitr::opts_chunk$set(cache = FALSE)
# Use chache = TRUE if you want to speed up compilation
# A function to allow for showing some of the inline code
rinline <- function(code){
  html <- '<code  class="r">``` `r CODE` ```</code>'
  sub("CODE", code, html)
}
remotes::install_github('rstudio/pagedown')
install.packages(c("rmarkdown", "knitr", "kableExtra",
                   "stargazer", "modelsummary", "knitr"))
cat(paste("#", capture.output(sessionInfo()), "\n", collapse ="")) 
  # or use message() instead of cat()
x <- 1:10
x
data <- read.csv("data.csv")
head(data)
dput(data)
data <- structure(list(X = 1:50, speed = c(4L, 4L, 7L, 7L, 8L, 9L, 10L, 
10L, 10L, 11L, 11L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 
14L, 14L, 14L, 15L, 15L, 15L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 
18L, 18L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 22L, 23L, 24L, 
24L, 24L, 24L, 25L), dist = c(2L, 10L, 4L, 22L, 16L, 10L, 18L, 
26L, 34L, 17L, 28L, 14L, 20L, 24L, 28L, 26L, 34L, 34L, 46L, 26L, 
36L, 60L, 80L, 20L, 26L, 54L, 32L, 40L, 32L, 40L, 50L, 42L, 56L, 
76L, 84L, 36L, 46L, 68L, 32L, 48L, 52L, 56L, 64L, 66L, 54L, 70L, 
92L, 93L, 120L, 85L)), 
class = "data.frame", row.names = c(NA, 
-50L))
library(knitr)
library(kableExtra)

kable(mtcars[1:10,], row.names = TRUE, 
      caption = 'Table with kable() and kablestyling()', 
      format = "html", booktabs = T) %>%
        kable_styling(full_width = T, 
                      latex_options = c("striped", 
                                        "scale_down",
                                        "HOLD_position"),
                      font_size = 10)


library(modelsummary)
datasummary_skim(mtcars, 
                 type="numeric", 
                 histogram=T, 
                 title = "Summary: Numeric variables")
# Create categorical variables
mtcars$vs_cat <- as.logical(mtcars$vs)
mtcars$cyl_cat <- as.factor(mtcars$cyl)
datasummary_skim(mtcars, 
                 type="categorical", 
                 title = "Summary: Categorical variables")
library(modelsummary)
M1 <- lm(Fertility ~ Education + Agriculture, data = swiss)
M2 <- lm(Fertility ~ Education + Catholic, data = swiss)
M3 <- lm(Fertility ~ Education + Infant.Mortality + Agriculture, data = swiss)
models <- list("M1" = M1, "M2" =  M2, "M3" = M3)

modelsummary(models, title = 'A regression table.')
plot(cars$speed, cars$dist)
library(ggplot2)

mtcars$cyl <- as.factor(mtcars$cyl) # Convert cyl to factor

ggplot(mtcars, aes(x=wt, y=mpg, shape=cyl)) + geom_point() +
  labs(x="Weight (lb/1000)", y = "Miles/(US) gallon", 
       shape="Number of \n Cylinders") + theme_classic()

library(plotly)

mtcars$cyl <- as.factor(mtcars$cyl) # Convert cyl to factor
mtcars %>%
  plot_ly(x=~wt,y=~mpg,color=~cyl) %>%
  add_markers() %>% 
  layout(xaxis=list(title="Weight (lb/1000)"), yaxis=list(title="Miles/(US) gallon"))
print(sessionInfo(), local = FALSE)

  1. Based on an earlier R Markdown template that uses Latex and can be downloaded under https://osf.io/q395s (see Bauer 2018).↩︎

  2. Corresponding adress: , ↩︎

  3. You can download various citation style files from this webpage: https://github.com/citation-style-language/styles.↩︎

  4. You can always check your working directory in R with getwd().↩︎

  5. open-source library to paginate content in the browser↩︎

  6. Another good folder setup would be to store all files needed as input files for the R Markdown manuscript in a subfolder called “input” and all output files that are produced apart from paper.html and paper.pdf in a subfolder called “output.”↩︎